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p  ABSTRACT  - 

.  -r.  1 

\  Common  'bus,  shared  memory,  multiprocessors  are  the  most 
widely  used  parallel  processing  architectures.  Unfortunately,  such 
systems  suffer  from  a  memory/bus  bandwidth  limitation  problem. 

For  the  designer  of  a  hybrid  optical/electronic  supercomputer,  an 
immediate  temptation  is  to  replace  the  shared  electronic  bus  with  an 
optical  analog  of  higher  bandwidth.  To  make  that  replacement  is 
only  a  partial  solution.  The  true  bottleneck  in  such  systems  is  in  the 
address  decoding  circuits  of  shared  memory  units. 

In  this  paper  we  propose  a  new  memory  structure  which  provides  for 
parallel  access  in  a  multiprocessor  environment.  The  proposed  system 
has  two  advantages.  First,  it  distributes  the  address  decoding  circui¬ 
try  to  each  of  the  requesting  units  on  a  common  bus,  thus  eliminat¬ 
ing  the  bottleneck  of  centralized  decoding  of  encoded  memory 
addresses.  Second,  it  allows  for  parallel  fetches  of  memory  data  with 
a  level  of  parallelism  limited  only  by  the  ratios  of  optical  to  elec¬ 
tronic  bus  bandwidths  and  the  dimensionality  of  the  memory  array.  - 
The  design  requires  no  active  optical  switching  devices  and  can  be 
implemented  using  the  mature  technologies  of  optical  sources, 
waveguides,  and  detectors.  The  two-dimensional  version  of  the  dev¬ 
ice  is  well  suited  to  an  integrated  optics  implementation. 
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1.  Introduction 

In  a  conventional  electronic  memory  circuit,  like  the  one  shown  in  Figure  1,  an 
incoming  memory  address  is  divided  into  row  and  column  addresses,  each  decoded 
separately.  The  selected  memory  location  is  the  intersection  of  the  select  lines  gen¬ 
erated  by  the  row  and  column  decoders.  In  common  bus  multiprocessors  these 
decoders  have  traditionally  been  a  performance  limiting  bottleneck.  Each  decoder 
can  process  only  a  single  encoded  address,  thus  limiting  memory  access  to  a  single 
location.  Memory  interleaving  techniques!  l},  which  subdivide  the  memory  space 
into  regions,  each  in  a  separate  memory  unit,  are  commonly  applied  in  an  attempt 
to  make  parallel  some  subset  of  memory  accesses.  More  recently,  sophisticated 
cache  memory  systems[2]  have  been  developed  which  physically  reproduce  por¬ 
tions  of  shared  memories  in  a  local  store.  Both  systems  have  obvious  limitations. 
Interleaved  systems  impose  an  ordering  in  which  parallel  accesses  to  a  shared 
memory  may  be  made,  and  cache  memories  rely  on  the  locality  of  memory  refer¬ 
ences  for  each  processor  and  require  a  large  overhead  to  support  cache  coherence. 

Our  solution  is  to  distribute  the  address  decoding  function  to  the  requesting  dev¬ 
ices,  thereby  breaking  contention  for  monolithic  address  decoders.  This  solution 
requires  the  abandonment  of  conventional  encoded  addresses  as  a  mechanism  for 
conserving  bus  bandwidth.  Rather,  we  will  use  the  high  bandwidth  of  optics  to 
time  multiplex  fully  decoded  addresses  into  an  optical  "select"  pulse  train.  Using  a 
technique  based  on  the  coincidence  of  optical  pulses,  the  optical  select  pulse  train 
can  be  directly  applied  to  a  memory  array  to  address  one  or  more  cells.  Effective 
parallelism  is  possible  in  this  system  because  of  the  differential  between  optical  and 

This  research  has  been  supported,  in  part,  under  Office  of  Naval  Research  contract  N00014-85-K-0339. 


Figure  1:  Conventional  Electronic  Memory  Structure 
electronic  bandwidth.  Within  a  single  electronic  memory  access  cycle,  N  parallel 
memory  references  are  possible,  where  N  is  limited  only  by  the  ratio  of  optical  to 
electronic  bandwidths.  For  a  fixed  bandwidth  ratio,  the  size  of  memory  which  can 
be  constructed  is  further  determined  by  the  dimensionality  of  the  memory  struc¬ 
ture.  The  technique  requires  no  active  optical  or  electro-optical  switching  devices. 
It  uses  only  the  mature  technologies  of  optical  sources,  waveguides,  and  photo¬ 
detectors.  In  the  two-dimensional  form,  the  system  is  well  adapted  to  an 
integrated  optics  implementation^,  4], 

The  addressing  mechanism,  which  we  call  "optical  pulse  delay  modulation",  is 
based  on  the  use  of  time  delays  between  optical  pulses.  The  optical  pulses  are  pro¬ 
pagated  through  waveguides  in  several  directions  through  the  memory  array.  By 
appropriately  adjusting  the  delays,  these  pulses  can  be  made  to  coincide  at  specific 
memory  cells.  This  coincidence  is  detected  by  photodetectors  at  the  addressed  loca¬ 
tions,  thereby  selecting  those  locations  for  memory  access. 

Our  primary  interest  is  in  the  application  of  this  addressing  mechanism  to  two- 
dimensional,  multi-ported,  memory  modules.  Such  structures  are  composed  of 
horizontal  and  vertical  waveguides  with  n  memory  cells  located  at  the  intersecting 
points.  With  proper  cell  layout,  up  to  'In  memory  cells  may  be  accessed  con¬ 
currently  by  sending  a  sequence  of  pulses  in  the  horizontal  and  vertical 
waveguides.  In  a  multiprocessor  environment,  a  sequence  of  pulses,  each 
corresponding  to  a  distinct  memory  reference,  is  generated  by  independent  address 
decoders  located  at  each  of  the  processing  units.  Therefore,  the  address  decoding 
function  is  completely  distributed  to  the  requesting  processors  and  there  is  no 


wj  w»ni  uhm  wvmnai  law  wwbiwh  w  iwom  ww 


Wff»VTI»VWW«»»irVWVTSFl 


-3- 


address  decoding  circuitry  at  the  memory  unit. 

This  paper  is  organized  as  follows,  in  section  two  we  introduce  coincident  pulse 
addressing  by  an  example  using  a  one  dimensional  memory  structure.  Included  in 
this  section  is  a  discussion  of  techniques  for  selecting  memory  locations,  generating 
parallel  addresses,  and  memory  data  output.  In  section  three  we  expand  the  tech¬ 
nique  to  a  more  realistic  two-dimensional  structure  and  address  such  issues  as 
memory  access  conflicts,  complexity,  and  organization.  Suggested  systems  for 
memory  write  control  are  presented  in  section  four.  In  the  concluding  section,  we 
discuss  higher  dimensional  structures,  implementation  issues,  and  the  research 
issues  which  must  be  addressed  to  physically  realize  such  a  system. 

2.  A  One  Dimensional  Memory  Array 

In  this  section,  we  introduce  the  technique  of  pulse  delay  addressing  using  an  one¬ 
dimensional  memory  array  as  an  example.  This  example  is  not  ideal  because  both 
the  hardware  complexity  and  access  time  grow  in  proportion  to  the  size  of  the 
memory.  This  is  not  the  case  for  the  higher  dimensional  structures  presented  in 
later  sections. 

2.1.  Optical  pulse  delay  addressing 

As  shown  in  Figure  2,  a  memory  module  is  composed  of  n  cells  C,,  •  •  •  ,Cn  ,  each 
storing  one  bit  of  information.  The  select  signal  for  each  cell  Ck  ,  is  an  electronic 
pulse  at  the  output  of  a  photodetector  Dk  .  The  photodetector  generates  the  logical 
OR  of  two  incident  optical  signals,  denoted  in  Figure  2  by  s  t  and  s  r 
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Figure  2:  Linear  Memory  Structure 

The  signals  s ,  and  s2  travel  in  opposite  directions  along  an  optical  path,  which  can 
be  either  an  optical  fiber  or  a  planar  waveguide  in  an  integrater  optical  device.[5] 
Photodetectors  are  placed  at  fixed  distance  intervals  d  along  the  optical  path,  and 
two  laser  diodes,  Lx  and  L,.  are  coupled  to  each  end.  Both  laser  diodes  are  nor¬ 
mally  on  and  the  circuits  of  all  detectors  normally  generate  a  logic  one  . 

Assume  that  two  dark  pulses  of  duration  r  are  transmitted,  one  from  L,  and  the 
other  from  L2  at  times  t  j  and  1 2,  respectively.  These  pulses  represent  "dark"  spots 
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propagating  afspeed  cg  (the  speed  of  light  in  the  waveguide).  By  carefully  select¬ 
ing  the  delay  between  t  j  and  r  2  the  two  dark  spots  can  be  made  to  meet  at  exactly 
one  detector.  This  detector  will  then  turn  off,  generating  a  logic  zero  of  duration 
r.  The  distance  d  between  any  two  detectors  is  chosen  to  be  equal  to  d  =  r  cg,  the 
propagation  distance  corresponding  to  the  pulse  duty  cycle.  The  delay  r  x — r  2  is  also 
chosen  such  that  it  is  an  even  multiple  of  d  .  More  specifically,  if 

r,  —  r2  =  (/i— 1  —  2U— 1))  t  (1) 

then,  the  two  dark  spots  will  meet  at  detector  Dk  ,  thus  addressing  cell  k .  For 
example,  when  n  =5,  if  L1  generates  its  dark  pulse  2 r  seconds  before  L  x  generates 
its  pulse,  then  (1)  gives  k  =2,  that  is  the  two  pulses  meet  at  Z)2.  Similarly,  if  L2 
generates  its  pulse  2r  seconds  after  L  j  generates  its  pulse,  then  the  two  pulses  meet 
at  D4.  Clearly,  the  middle  cell  is  chosen  by  generating  the  two  pulses  simultane¬ 
ously,  that  is  having  t ,  - 1 2.  Therefore,  the  address  of  the  cell  is  encoded  using  the 
delay  r  x—t  2.  In  this  view,  the  pulse  generated  by  Lx  is  treated  as  the  reference 
pulse  and  the  pulse  generated  by  Z.2  becomes  a  select  pulse.  In  the  remaining  dis¬ 
cussion,  the  names  xrtf  ,  Lrtf  .  tsll ,  and  Lul ,  will  refer  to  1 j,  L ,,  r  2,  and  L2  respec¬ 
tively. 

The  memory  access  time  is  determined  by  the  maximum  delay  needed  to  address 
any  cell  in  the  array.  From  ( 1),  it  is  clear  that  for  k  =1,  •  •  •  ,n  ,  we  have 

-  (n  -  1 )  t  <  Trtf  -tstl  <  (n  -1 )  r  (2a) 

from  which  we  find  that  the  memory  access  time,  a,  is  given  by 

a  =  2  n  t  (2b) 

Note  that  equation  (2a)  indicates  that  the  select  pulse  occurs  within  n  t  before  or 
after  the  reference  pulse. 

The  parallelism  in  this  addressing  scheme  comes  from  the  fact  that  within  time  o, 
it  is  possible  to  address  more  than  one  cell  by  sending  a  series  of  pulses  from  Lsel , 
one  for  each  memory  reference.  Each  of  these  pulses  will  intersect  with  the  refer¬ 
ence  pulse  at  the  desired  detector.  In  other  words,  parallel  memory  references  are 
positionally  distinguishable  in  a  pulse  train  generated  by  a  series  of  select  pulses. 

In  the  next  two  sections  we  will  describe  how  this  memory  can  be  incorporated 
into  a  shared  memory  multiprocessor.  Two  design  issues  at  the  interface  between 
the  electronic  processing  units  and  the  optical  system  must  be  resolved.  First,  a 
system  for  generating  a  series  of  optical  pulses,  corresponding  to  the  select  pulse 
train  must  be  specified.  Second,  the  memory  must  allow  data  stored  in  the  refer¬ 
enced  locations  to  be  returned  to  the  requesting  processors  within  a  single 
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processor-memory  cycle. 


22.  Address  Generation 

One  of  the  major  advantages  of  the  proposed  memory  organization  over  conven¬ 
tional  systems  is  the  removal  of  the  address  decoding  function  from  the  memory 
unit,  and  the  distribution  of  this  function  to  the  requesting  processors.  More 


specifically,  each  of  the  processors  is  assumed  to  generate  normal  encoded  addresses 
when  referencing  the  shared  memory.  These  addresses  are  decoded  locally  by  an 
address  decoder  attached  to  each  processor.  The  decoded  addresses  are  electroni¬ 
cally  OR’ed  onto  a  select  bus  common  to  all  of  the  processors.  In  the  one¬ 
dimensional  case  the  select  bus  consists  of  n  lines,  each  controlling  a  laser  diode 
pulser  (see  Figure  3).  As  will  be  explained  in  Section  3,  the  size  of  the  select  bus  in 
the  two-dimensional  case  reduces  to  a  more  manageable  2 \/JT  lines  controlling 
2sfn  laser  diode  pulsers. 

Returning  to  the  linear  case,  the  optical  pulse  train  containing  the  memory  requests 
is  generated  by  2n  laser  diode  pulsers  which  are  spaced  at  incremental  distances  d 
in  optical  path  length  from  the  memory.  In  order  to  reconcile  the  optical  to  elec¬ 
tronic  bandwidth  difference,  a  single  edge  in  the  electronic  time-base  (denoted 
sync  1  in  Figure  3)  controls  the  activation  of  all  the  pulsers  such  that  all  the  opti¬ 
cal  pulses  are  generated  simultaneously.  If  the  duration  of  each  optical  pulse  is 
equal  to  r,  then  the  select  pulse  train  will  be  confined  to  2 n  time  slots,  each  with 
duration  r.  Since  proper  time  multiplexing  of  select  pulses  is  not  possible  with  the 
select  pulsers  constantly  on,  the  n  select  pulsers  connected  to  the  electronic  select 
bus  are  separated  by  n  pulsers  modulated  by  a  fixed  one  signal  (see  Fig.  3).  Thus 
all  the  slots  will  contain  an  optical  pulse  except  those  slots  corresponding  to 
requested  memory  addresses.  More  specifically,  a  dark  spot  (no  pulse)  at  slot  2/  —1 
corresponds  to  a  request  for  memory  location  i . 


In  addition  to  the  select  pulse  train,  a  dark  reference  pulse  must  be  generated. 
Alternatively,  the  reference  pulse  is  a  series  of  2 n  light  pulses,  with  a  single  dark 
pulse  at  position  n  .  Such  a  pulse  train  may  be  produced  by  a  single  laser  diode 
which  is  normally  on  and  which  is  pulsed  off,  for  duration  r,  upon  the  reception 
of  the  synchronization  pulse  syncl.  For  the  dark  pulse  in  the  reference  train  to 
coincide  with  pulse  slot  n  in  the  select  pulse  train,  the  optical  path  from  the  refer¬ 
ence  diode  to  the  memory  should  be  equal  to  the  optical  path  from  the  diode  gen¬ 
erating  the  middle  pulse  in  the  select  train  to  the  memory. 


The  above  address  generation  scheme  is  crucially  dependent  on  the  simultaneous 
pulsing  of  several  laser  diodes.  In  an  integrated  optics  environment,  such  syn¬ 
chronization  problems  can  be  avoided  by  using  a  single  optical  pulse  source  of 
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Figure  3:  Distributed  Address  Generator 

duration  r,  and  a  series  of  electro-optic  switches.  In  such  an  implementation  each 
laser  diode  in  Figure  3  is  replaced  by  a  directional  coupler  which  "couples  in"  the 
pulse  at  various  optical  path  lengths.  Synchronization  problems  are  replaced  in  this 
system  by  a  new  set  of  issues  relating  to  optical  power  distribution.  We  will  dis¬ 
cuss  this  and  other  issues  relating  to  an  integrated  optics  implementation  in  later 
sections. 

23.  Parallel  Memory  Read 

Another  issue  at  the  interface  between  the  electronic  processing  units  and  the  opti¬ 
cal  system  is  a  mechanism  for  returning  the  electronically  stored  data  from  the 
memory  to  the  processors.  The  data  is  returned,  on  an  optical  bus,  in  a  pulse  train 
which  consists  of  n  slots,  one  for  each  memory  location.  Therefore,  parallel 
accesses  are  positionally  distinguishable  in  the  read  pulse  train.  When  this  pulse 
train  arrives  at  a  processor  which  has  issued  a  read  request  for  the  ilh  position  of 
the  store,  this  processor  will  find  the  requested  data  in  the  ith  slot  of  the  data 
pulse  train. 
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One  method  for  generating  the  read  data  pulse  train  is  to  use  a  structure  similar  to 
the  one  depicted  in  Figure  4.  In  this  structure,  n  laser  diodes  are  placed  on  the  opt¬ 
ical  data  bus  separated  by  an  optical  distance  d  .  When  a  specific  memory  location 
k  is  addressed,  an  electronic  signal  is  generated  from  the  photodetector  as  described 
in  Section  2.1.  The  data  at  location  k  is  assumed  to  be  stored  electronically  and  is 
used  to  modulate  the  k 1  laser  diode  only  if  location  k  is  addressed.  A  synchroni¬ 
zation  signal,  sync  2,  is  used  to  synchronize  the  output  of  light  (positive)  pulses  of 
duration  r  from  the  selected  memory  locations  which  store  a  one .  The  difference 
in  the  optical  path  lengths  between  the  laser  diodes  ensures  the  correct  generation 
of  the  data  pulse  train.  A  similar  technique  is  used  to  demultiplex  the  pulse  trains 
by  using  detectors  at  fixed  distances  d  ,  and  latching  the  pulse  train  at  each  proces¬ 
sor  interf  ace. 

3.  A  Two  Dimensional  Memory  Structure 

Using  the  above  mechanism,  it  is  possible  to  address  all  of  the  n  memory  locations 
in  one  processor-memory  cycle.  For  the  one-dimensional  case,  this  represents  a 
sequential  read  of  the  entire  store,  and  requires  a  ratio  of  electronic  to  optical  time 
bases  equal  to  the  size  of  the  memory.  Even  for  the  most  optimistic  assumptions 
about  achievable  optical  pulse  widths,  this  structure  will  be  inadequate  and  waste¬ 
ful.  However,  by  applying  a  similar  addressing  mechanism  to  two-dimensional 
memory  arrays,  the  required  ratio  of  electronic  to  optical  time  bases  reduces  to 
■Jn  ,  where  at  most  v/n"  memory  locations  may  be  addressed  in  one  cycle.  This 
allows  f or  the  construction  of  reasonable  size  shared  memories. 

3.1.  Coincident  Wavefront  Addressing 

In  the  two-dimensional  case  we  generalize  the  propagation  of  "dark  spots"  in  one 
dimension  to  the  propagation  of  linear  dark  wavefronts  moving  through  a  series  of 
parallel  waveguides.  Hence,  the  method  of  addressing  a  location  by  programming 
the  intersection  of  two  dark  spots  may  be  generalized  to  addressing  a  location  in  a 
two-dimensional  array  by  programming  the  intersection  of  three  dark  wavefronts. 
The  literature  on  systolic  and  wavefront  arrays  (see  for  e.g.[6,  7])  suggests  many 
possible  ways  for  propagating  and  programming  the  intersection  of  wavef ronts.  In 
this  section,  we  present  a  simple  propagation  scheme  which  may  be  used  in  2- 
dimension  memory  addressing. 

Consider  2-D  memory  arrays  similar  to  the  one  shown  in  Figure  5.  An  array  of 

size  n  is  composed  of  \/n"  X\ZrT  photodetector/cell  units  DClJt  i  ,j  =  1 . \/n  , 

separated  by  a  distance  d  =  r  cg  in  both  the  vertical  and  the  horizontal  directions. 
The  structure  of  a  DC  unit  is  identical  to  the  linear  example,  except  that,  the  input 


row  select  pulses 


priority 

dau 


—  col 

col  select  pulses 


v  ; 


V  7  / 

\  <,-•»  / 


\  (  )  / 
"  V 

"■r 


data  out  pulses 


Figure  5:  Two  Dimensional  Memory  Structure 
to  the  photodetector  generates  the  logical  OR  of  three  optical  signals.  Specifically,  a 
dark  reference  wavefront  generated  by  the  reference  diode  Lref  ,  a -select  pulse 
train  generated  f  rom  a  distributed  set  of  column  address  decoders,  Lcol ,  both  trav¬ 
eling  horizontally  in  opposite  directions,  and  a  select  pulse  train  generated  by  a  dis¬ 
tributed  set  of  row  select  decoders,  Lrov ,  traveling  vertically. 

Tf  /ptical  signal  generated  by  each  source  is  decoupled  from  the  source  fiber  by  a 
"squid"  type  connection  into  Vn"  signals  that  travel  through  the  array  in  parallel 
waveguides.  Since  the  optical  path  length  of  all  legs  in  the  squid  will  be  equal,  the 
wavefront  will  arrive  at  all  locations  in  a  single  row  (or  column)  simultaneously. 
For  example,  an  optical  pulse  generated  by  Lref  and  directed  horizontally  through 

the  array  will  be  simultaneously  incident  at  locations  DC  ,  j  =1 n//T  ,  in 

column  i .  Similarly,  any  pulse  generated  by  Lcgl ,  will  arrive  at  all  the  cells  in  a 
specific  column  simultaneously,  and  any  pulse  generated  by  Lrgv  will  arrive  at  all 
the  cells  in  a  specific  row,  simultaneously. 

In  order  to  derive  the  equations  that  govern  the  intersections  of  three  wavef ronts, 
assume,  as  in  the  case  of  the  linear  array,  that  all  three  laser  diodes  are  on  and  that 
Lre/  generates  a  dark  pulse  of  duration  r  at  time  tre/  .  Also,  Lcol  and  Lrgw  gen¬ 
erate  dark  pulses  at  times  tcgl  and  trgv ,  respectively.  If  the  timing  of  Lcol  is  such 
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then,  the  two  dark  wavef ronts  generated  by  Lre,  and  4c  will  meet  at  column  j 
of  the  array.  In  order  to  select  a  particular  cell  DCl  ;  in  that  column,  the  third 
dark  wavef  ront,  namely  the  one  generated  by  Lrow  ,  should  be  crossing  row  i  when 
the  other  two  wavefronts  meet  at  column  j  .  This  may  be  accomplished  by  timing 
Lrow  such  that 

W  -Jrov  =  0  -Or  (4) 

In  other  words,  to  address  a  certain  memory  location  i  ,j  ,  the  column  number  j  is 
encoded  as  tre}  —tcol  and  the  difference,  j  —i ,  between  the  column  number  and  the 
row  number  is  encoded  as  iref  —trow .  From  (3)  and  (4),  it  may  be  shown  that 

-{'/n  -  1)  r  <  tre/  -tcol  <  (v'n’-l)r 

and 

-(Jn  -  1)  r  <  tref  -trov  <  (n/tT  -l)r 

and  hence,  the  memory  access  time,  a ,  is 

a=2  \fn  t  (5) 

As  in  the  linear  case,  parallel  accesses  are  possible  by  generating  multiple  pulses  in 
the  row  and  column  select  signals.  For  example.  Figure  6(a)  shows  the  pulse 
trains  for  the  selection  of  memory  locations  (2,2),  (1,4)  and  (4,4)  in  the  16  location 
memory  array  of  figure  6(b).  For  these  three  locations,  trtf  —tcol  should  be  equal  to 
-1,  3  and  3,  respectively,  and  tref  —trov  should  be  equal  to  0,  3  and  0,  respectively. 
The  locations  of  the  wavefront  resulting  from  these  trains  at  times  0,  5r  and  7 r  are 
shown  in  Fig  6(b/c/d).  From  these  figures,  it  is  clear  from  the  intersection  of  the 
dark  fronts  that  location  (2,2)  is  selected  at  time  5t,  and  locations  (1,4)  and  (4,4) 
are  selected  at  time  7 r. 

Using  the  above  scheme,  it  is  possible  to  encode  the  addresses  of  all  of  the  n 
memory  cells  in  the  column  and  row  pulse  trains  during  a  single  memory  access 
cycle.  However,  for  a  time  multiplexed  memory  read  such  as  the  one  proposed  in 
section  2.3,  the  length  of  the  return  data  pulse  train,  and  hence  the  total  read  time, 
would  grow  linearly  with  memory  size.  To  prevent  this,  and  to  facilitate  a  pipe¬ 
lined  implementation,  the  maximum  length  of  the  read  pulse  train  is  chosen  to  be 
2\fn  ,  the  length  of  the  select  pulse  trains.  This  is  actually  an  advantage  of  the 
two-dimensional  structure.  More  specifically,  in  the  one-dimensional  case,  the 
memory  cycle  time  was  directly  proportional  to  the  size  of  the  store.  Each  cycle 
needed  to  provide  an  optical  time-base  slot  for  each  location.  In  the  two- 
dimensional  case,  access  time,  and  the  possible  number  of  parallel  accesses,  is  pro¬ 
portional  to  the  square  root  of  the  number  of  locations  in  the  store.  This  is  a  far 
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more  realistic  scenario  for  constructing  a  shared  memory  multiprocessor  system. 
The  price  paid  for  this  reduction  in  access  time  is  the  potential  for  conflicts  in 
parallel  memory  accesses.  In  the  next  section,  we  suggest  a  technique  which  allows 
for  at  most  \fn  distinct  memory  references  per  cycle. 


3.2.  Resolution  of  conflicts  in  memory  access 


There  are  two  policies  that  may  be  applied  to  limit  the  number  of  memory  refer¬ 
ences  during  a  given  cycle  to  \fn  .  The  first  policy  is  to  allow  only  \fn  addresses 
to  reach  the  memory  module  during  the  cycle,  and  the  second  is  to  allow  as  many 
as  n  addresses  to  reach  the  memory  but  return  only  the  contents  of  Jn’  of  these 
addresses.  The  first  policy  requires  active  optical  switching  devices  to  resolve 
conflicts  in  the  incoming  select  pulse  trains.  To  avoid  the  need  for  such  devices,  we 
choose  to  implement  the  second  policy,  which  allows  full  addressing  and  prioritizes 
referenced  locations  for  conflict  resolution. 


The  same  data  collection  circuitry  used  in  the  linear  array  is  used  to  collect  data 
for  each  column  of  the  two-dimensional  array.  One  waveguide  is  dedicated  to  the 
collection  of  the  contents  of  the  addressed  cells  in  each  column.  The  signals  in  the 
'fn  waveguides  are  merged  into  a  single  wave  guide  (denoted  "data  out  pulses"  in 
Figure  5)  which  returns  the  data  to  the  processors.  The  lengths  of  the  waveguides 
are  adjusted  such  that  the  optical  paths  between  any  two  cells  i and  i  ,j  '  in  the 
same  row,  i ,  and  the  merging  point  are  equal. 

With  this  collection  mechanism,  the  content  of  any  referenced  cell  i  ,j  in  column 
j  will  appear  in  the  (\fn  —i  +\)th  time  slot  on  the  waveguide  of  column  j  .  How¬ 
ever,  when  the  -fn  pulse  trains  corresponding  to  the  'Jn  columns  are  merged  in 
the  data  out  waveguide,  the  data  produced  by  any  two  cells  i  ,j  and  i  j '  in  the 
same  row  and  different  columns  will  collide.  Since  we  have  elected  not  to  provide 
a  mechanism  for  preventing  conflicting  addresses  in  the  input  pulse  train,  conflict 
resolution  must  be  built  into  the  memory  array.  That  is,  requests  for  memory 
references  in  the  same  row  may  be  allowed,  but  only  one  request  should  be 
satisfied.  Two  problems  arise  for  such  a  system.  First,  a  mechanism  to  allow  only 
one  cell  per  row  to  output  its  data  must  be  devised.  Second,  a  method  is  needed  for 
announcing  which  of  the  conflicting  requests  have  been  satisfied. 


We  start  by  resolving  the  second  issue.  The  discussion  to  this  point  has  assumed  a 
single  bit  memory.  In  fact,  memory  locations  will  contain  an  entire  word  of  VV 
bits,  stored  electronically  and  returned  in  parallel  on  VV  optical  data  out  lines.  If 
each  memory  location  is  tagged  with  its  column  number,  then  each  processor  may 
read  the  column  address  along  with  the  data  and  use  the  data  only  if  the  address 
coincides  with  its  request.  If  not,  the  processor  must  re-issue  the  memory  request. 
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For  a  memory  of  size  n  ,  log(  ~Jn  )  tag  bits  are  needed,  which  increases  the  number 
of  bits  stored  in  each  memory  location  from  W  to  W  +log(  \/n  ). 

The  first  of  the  conflict  resolution  issues  is  the  more  difficult.  To  prevent  bus  con¬ 
tention  between  conflicting  requests,  we  have  chosen  a  priority  system  based  on  an 
optical  priority  chain,  like  the  one  shown  in  Figure  7.  This  figure  shows  a  single 
row  of  a  two-dimensional  array.  The  optical  distance  from  each  cell  in  this  row  to 
the  data  output  waveguide  is  equal  and  hence  any  parallel  accesses  within  this  row 
will  conflict.  In  order  to  avoid  this  conflict,  only  one  of  the  optical  sources  along 
this  row  may  be  allowed  to  generate  data.  In  Figure  7a,  the  horizontal  waveguide 
connecting  all  cells  in  the  row  forms  an  optical  priority  chain  to  resolve  these 
conflicts. 

priority  chain  waveguide 
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Figure  7  -  Conflict  Resolution  Strategy 

Figure  7b  shows  a  diagram  of  a  typical  memory  cell.  The  optical  OR  output  from 
the  pulse  sensing  photodetector  sets  the  select  latch.  This  output  gates  the  contents 
of  the  memory  cell  through  the  three  input  electronic  output-control-NOR  gate. 
The  third  input  to  this  NOR  gate  is  the  priority  control.  This  signal  is  the  output  of 
a  photodetector  which  senses  select  signals  for  higher  priority  cells  indicated  opti¬ 
cally  on  the  priority  chain  waveguide.  The  local  select  signal  is  also  used  to  turn  on 
a  laser  diode,  which  injects  light  into  the  priority  chain  waveguide  to  indicate  its 
selection  to  lower  priority  cells.  Finally,  the  synchronization  signal,  sync  2, 


>»✓ 


■ 


-  13  - 


ensures  that  all  data  out  pulsers  are  activated  simultaneously. 

From  the  above  description,  it  is  clear  that  every  memory  read  cycle  is  divided  into 
three  stages.  In  the  first  stage,  the  select  pulse  trains  are  generated  and  propagated 
through  the  memory  array.  The  minimum  time  required  to  complete  this  stage  is 
equal  to  2  Jn  r. 

In  the  second  stage,  the  read  operation  is  propagated  through  the  memory  cell  elec¬ 
tronics.  Note  that  in  Figure  5.  the  priority  chain  waveguides  are  parallel  lo  the 
reference  pulse  waveguides.  In  this  arrangement,  the  wavefront  which  encodes 
current  priority  propagates  through  the  priority  chain  waveguides  during  the  first 
stage  of  the  pipeline.  The  priority  is  delayed  relative  to  the  reference  pulse  by  time 

tpd  =*s  +ti  • 

where  ts  is  the  switching  delay  of  the  detector/latch  circuit  and  is  the  turn  on 
time  for  the  priority-out  laser  diode.  At  each  cell  in  the  rightmost  column  of  Fig¬ 
ure  5,  which  are  the  last  cells  to  see  the  reference  pulse  and  thus  the  lowest  in  the 
priority  chain,  the  priority  input  signal  arrives  at  the  output  control  gate  at  time 
tpd  +Td  relative  to  the  arrival  of  the  reference  pulse.  (rd  is  the  response  time  of  the 
priority-in  detector.)  Since  the  select  signal  arrives  at  the  output  control  gate  in 
time  ts  ,  the  critical  timing  path  for  the  second  stage  of  the  pipeline  is 

where  tg  is  the  output  control  gate  switching  time.  By  noting  that  rl  and  rd  must 
be  less  that  or  equal  to  the  pulse  width  r,  we  can  place  the  lower  bound  on  second 
stage  pipeline  delay  at 

ls  +tg  +r 

Finally,  in  the  third  stage  of  the  pipeline,  data  is  returned  from  the  memory  to  the 
requesting  processors  in  a  pulse  train  of  >/JT  bits.  Thus,  the  minimum  time 
required  for  the  third  stage  is  \fn  r  Assuming  that  the  memory  size  and  the  ratio 
of  electronic  to  optical  bandwidths  is  sufficient  to  satisfy 

ts+ig  <{2'Jn-\)r, 

the  longest  of  the  stages  is  the  first.  Using  such  a  three  stage  pipeline  the  total 
memory  cycle  length  is  then  6 •Jn  r.  Since  we  are  accessing  the  memory  in  a  pipe¬ 
lined  fashion,  and  each  stage  can  process  'Jn  references,  the  effective  memory 
bandwidth  limit  is  l/(2r)  words/second. 

Finally,  it  should  be  mentioned  that  it  is  possible  to  support  2 n/tT  memory  refer¬ 
ences  per  cycle,  rather  than  merely  >//T .  by  rearranging  the  data  collection 
waveguides  of  Figure  5.  If  the  data  collection  waveguides  are  run  diagonally,  2 \//T 
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waveguides  can  be  accommodated  at  the  price  of  more  complex  and  unevenly  dis¬ 
tributed  conflict  resolution  scheme. 


3.3.  Organization  of  Memory  Modules 

In  the  above  discussion,  we  have  concentrated  on  select  and  read  mechanisms 
assuming  a  one  bit  memory.  In  an  n  xW  bit  memory  module  we  would  reproduce 
W  copies  of  the  memory  cell,  the  output-control-NOR  gate,  and  the  data  output 
pulser  of  Figure  7b.  For  reading,  only  one  decoder-select  plane,  and  one  priority 
chain  waveguide  is  necessary  for  each  of  the  W+log(\//T  )  bit  words  in  an  n  word 
array.  The  only  optical  system  which  must  be  scaled  with  the  number  of  bus  in 
the  word  is  the  read  output  pulsers.  In  the  same  manner  as  an  electronic  memory, 
we  must  provide  a  separate  return  path  for  each  bit. 


4.  Memory  Write  Control 

For  a  conventional  memory,  support  for  write  operations  would  require  an  addi¬ 
tional  control  signal  and  a  secondary  data  path  for  incoming  data.  Merely  provid¬ 
ing  these  additional  signals  in  a  parallel  memory  will  not  be  adequate,  since  the 
additional  issue  of  resolving  mixed  parallel  reads  and  writes  must  also  be  resolved. 
There  are  three  possibilities  for  resolving  this  issue  which  trade  off  write  access 
time  for  optical  circuit  complexity. 

1)  Exclusive  write:  in  this  solution  we  eliminate  the  possibility  of  mixed 
read/write  operations  and  conflicting  write  operations.  By  implementing  an 
external  arbitration  mechanism,  we  allow  only  exclusive  write  access  to  the 
memory.  Once  a  single  processor  is  selected,  it  can  perform  writes  to  the 
memory  using  conventional  electronics.  This  system  requires  no  additional 
optics  at  the  cost  of  non-parallel  writes.  If  the  ratio  of  writes  to  reads  for  the 
shared  memory  structure  is  relatively  low,  then  exclusive  write  access  may 
represent  a  viable  low  complexity  alternative. 

2)  Full  parallel  read/write:  for  full  parallel  optical  writes,  it  is  not  actually  neces¬ 
sary  to  provide  a  separate  optical  write  data  bus.  Instead,  by  combining  control 
and  data  information,  in  lieu  of  a  data  bus,  it  is  possible  to  provide  fully  paral¬ 
lel  non-conflicting  writes  to  any  of  the  n  locations  in  the  store. 

In  this  technique,  each  memory  bit  sees  two  bits  of  select  information  in  each 
cycle,  one  bit  from  the  read  select  optical  circuit  already  described,  and  a  second 
from  a  per— bit  local  copy  of  that  selection  circuit  used  for  write  control. 
These  two  bits  encode  four  states:  read,  write  a  /:ero,  write  a  one,  and  do  noth¬ 
ing  (no  select).  Thus,  by  reproducing  at  each  bit  in  the  word  a  second  address 
selection  structure,  and  by  judicious  selection  of  code  assignments  for  the  four 


states,  the  cost  of  this  system  is  limited  to  the  addition  of  one  optical  selection 
plan*  for  each  bit. 

This  technique  allows  fully  mixed  read  and  parallel  writes.  Any  processor  can 
write  to  any  word  with  no  conflict  restrictions  on  the  rows  or  columns  of  write 
addresses.  However,  there  is  no  conflict  resolution  mechanism  to  prevent  two 
processors  from  writing  to  the  same  address.  As  with  a  multi-ported  electronic 
memory,  such  operations  would  generate  unpredictable  results.  It  is  assumed 
that  these  mutual  exclusion  issues  would  be  addressed  by  appropriate  software. 

3)  Bit  Serial  Parallel  Write:  At  a  cost  of  lengthening  the  overall  write  time,  it  is 
possible  to  reduce  the  overhead  for  the  full  parallel  read/write  system  to  one 
additional  optical  selection  plane  by  using  a  bit  serial  approach.  In  this  system,  a 
designated  cycle  initiates  an  W  bit  serial  write.  The  select  signals  generated  by 
the  read  selection  circuit  during  this  cycle  are  latched  separately,  and  held  for 
the  duration  of  the  serial  write.  Each  subsequent  cycle  uses  the  write  selection 
circuit  to  serially  process  each  bit  in  the  word.  Meanwhile,  parallel  reads  are 
still  possible,  concurrent  with  the  serial  write.  A  global  counter/decoder  cir¬ 
cuit,  to  define  the  "current-bit"  to  be  one  of  the  W  data  bit  planes  of  the 
memory,  is  necessary.  It  is  the  only  additional  overhead  for  this  system. 

5.  Extensions  and  Future  Research 

We  have  concentrated  in  this  presentation  on  the  details  of  a  two  dimensional 
memory  array  because  of  its  suitability  to  integrated  optical  implementations. 
There  is  nothing  in  the  design  that  prevents  the  linear  wavefronts  in  two  dimen¬ 
sional  arrays  from  being  generalized  to  planar  waves  in  three  dimensions  arrays. 
In  general,  for  m  -dimensional  memory  arrays,  the  access  time  is  reduced  to  the 
mth  root  of  the  memory  size.  Hence,  for  a  fixed  bandwidth  in  the  electronic  sys¬ 
tem,  the  bandwidth  requirements  for  the  optical  system  are  substantially  reduced. 
This  reduction  is  gained  at  the  price  of  a  corresponding  reduction  in  the  number  of 
locations  which  can  be  referenced  in  parallel  during  a  single  electronic  cycle.  Like 
the  access  time,  this  number  is  also  reduced  to  the  mth  root  of  the  memory  size. 

Extensions  of  this  technique  can  also  be  applied  to  other  two-dimensional  switch¬ 
ing  structures.  For  example,  the  control  information  for  a  crossbar  switch  can  be 
encoded  into  pulse  trains  and  used  to  address  specific  switches  in  the  crossbar. 
Pulse  delay  encoded  control  information  can  be  prepended  on  incoming  packets  to 
combine  routing  information  and  data  without  the  need  for  optical-optical  switch¬ 
ing  devices. 

The  question  is:  Can  such  a  memory  be  built,  and  will  its  size  and  performance 
make  it  feasible  for  integration  in  computer  systems  of  the  next  decade  and 
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beyond?  The  following  issues  must  be  examined. 

1)  Scalability:  The  scale  of  the  physical  device  is  directly  related  to  the  optical 
pulse  width.  To  minimize  the  physical  spacing  of  detectors,  very  short  pulses 
are  required.  For  instance,  to  reduce  detector  spacing  to  a  scale  that  will  allow 
monolithic  integration,  pico-second  pulse  widths  will  be  required.  Specifically, 
lps  pulses  will  allow  a  detector  spacing  of  200  to  300/umeiers  depending  on  the 
refractive  index  of  the  optical  medium.  Current  commercial  technology  for 
pulsing  laser  diodes  in  discrete  devices  provides  for  pulses  on  the  order  of 
lOOps.  Recent  research  has  produced  optical  pulses  as  short  as  8  fentto- 
seconds[8,  9].  In  such  research,  the  common  technique  for  pulse  duration  meas¬ 
urement  is  to  split  the  pulse  into  two  optical  paths  and  detect  coincidence  when 
the  paths  are  recombined  at  varying  optical  path  lengths.  Based  on  this  trend  we 
expect  that  the  necessary  pulse  widths  for  an  integrated  optical  implementation 
will  be  available  in  the  near  future.  Meanwhile,  we  are  currently  using  com¬ 
mercial  discrete  devices  and  optical  fibers  to  examine  scalability  issues. 
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2)  Detection  limits:  A  second  limit  on  usable  pulse  duration  is  the  issue  of  detec¬ 
tor  technology  for  coincident  light  pulses.  A  two  dimensional  memory  requires 
that  as  many  as  three  dark  pulses  are  to  be  detected  as  they  overlap.  Even 
assuming  the  existence  of  photo-detectors  of  sufficient  bandwidth  for  single 
pulse  detection,  what  degree  of  overlap  is  required  to  generate  the  optical  OR  of 
three  pulses?  Extending  to  multi-dimensional  structures,  fan-in  limitations  of 
this  type  become  even  more  critical. 

3)  Fabrication  vs.  physical  limits:  As  the  system  scales  down  in  size,  and  up  in 
speed,  what  limits  will  be  reached  first,  the  fabrication  limits  of  the  technology 
or  the  physical  limitations  of  the  optical  systems? 

4)  Clocking  issues:  While  there  is  little  doubt  that  sufficiently  narrow  pulses  can 
be  generated  by  the  electro-optics,  the  precision  to  which  multiple  pulses  can  be 
synchronized  is  an  important  question.  Two  coincident  pulses  must  be  timed  to 
arrive  with  a  precision  of  +/-10%  of  their  pulse  width  to  allow  at  least  80% 
overlap.  This  means  that  the  electronic  components  must  gate  the  optical  sig¬ 
nals  with  a  constant  delay  that  is  precise  to  the  optical  time-base.  Clock  distri¬ 
bution  issues  have  been  studied  extensively  in  both  electronic  and  optical 
domains.  We  believe  that  the  required  precision  can  be  achieved  by  electronic 
circuitry.  As  an  alternative,  optical  clock  distribution  techniques  such  as  those 
proposed  by  Clyme’-  and  Goodman!  10]  can  be  applied. 
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5)  Select  Latch  Response:  The  electronic  latches  at  the  detector  sites  must  respond 
to  the  selection  pulses  from  optical  detectors.  These  pulses  will  be  no  longer 
than  the  duration  of  the  coincident  pulses.  This  is  a  limitation  on  the  speed  of 
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the  optical  system. 

6)  Waveguide  Decoupling  Technology:  In  a  two-dimensional  design  it  is  neces¬ 
sary  to  split  incoming  optical  signals  into  parallel  row  and  column  waveguides. 
Detectors  at  each  intersection  must  couple  out  sufficient  optical  power  for  detec¬ 
tion,  without  significantly  degrading  the  optical  signal.  Such  highly  asym¬ 
metric  single-inode  directional  couplers  have  been  developed  lor  optical  fiber 
and  are  commercially  available!  1 1  ].  Several  other  techniques  lor  low  power 
output  coupling  have  been  examined  by  Jackson  et  all  12].  Further  work  is 
needed  to  apply  these  techniques  in  an  integrated  optic  environment. 

In  summary,  we  have  presented  a  system  which  distributes  the  address  decoding 
function  to  the  requesting  units  on  an  optical  bus.  In  this  system,  addresses  become 
optical  pulse  trains,  and  by  arranging  the  optical  paths  we  provide  a  selection 
mechanism  based  on  the  coincidence  of  these  pulses.  In  the  coming  year  we  plan  to 
begin  construction  of  a  64x16  bit  register  file  based  on  this  research  and  using 
discrete  optical  devices  and  fiber  waveguides.  This  register  file  will  be  used  as 
shared  memory  in  a  prototype  eight  node  multiprocessor. 
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